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,S^ • Abstract 

-)— » . 
C^ , We study the problem of estimating the coefficients in linear ordinary 

differential equations (ODE's) with a diverging number of variables when 
the solutions are observed with noise. The solution trajectories are first 
smoothed with local polynomial regression and the coefficients are esti- 
mated with nonconcave penalty proposed by [J]. Under some regularity 
<^ , and sparsity conditions, we show the procedure can correctly identifies 

nonzero coefficients with probability converging to one and the estimators 
for nonzero coefficients have the same asymptotic normal distribution as 
they would have when the zero coefficients are known and the same two- 
step procedure is used. Our asymptotic results are valid under the mis- 
JN I specified case where linear ODE's are only used as an approximation to 

nonlinear ODE's, and the estimates will converge to the coefficients of 
the best approximating linear system. From our results, when the solu- 
tion trajectories of the ODE's are sufficiently smooth, the parametric ^/n 
rate is achieved even though nonparametric regression estimator is used 
in the first step of the procedure. The performance of the two-step proce- 
dure is illustrated by a simulation study as well as an application to yeast 



- 1 — I 
j^ ' cell-cycle data. 



1 Introduction 

Ordinary differential equations are widely used to describe systems in physics, 
chemistry and biology. Many such systems can be described by the initial value 
problem 

m' = F{m,0) ,^^. 

m(Q) — Too 

where m — {mi, . . . ,mp) represents the state of the system. When some 
simple regularity conditions on the smoothness of F are imposed, there ex- 



ists a unique solution of the nonlinear ODE, at least in a small neighborhood 
of zero. Analytical insolvability of nonlinear equations necessitates numerical 
methods to find the solution for the initial value problem. On the other hand, 
the statisticians are concerned with the estimation of the parameters 9 {F is 
assumed known) given noisy solutions Fy = mj{Xi) + eij,j = 1, . . . ,p, observed 
at time points Xi,X2, . . . ,Xni- This problem has been investigated by many 
authors. There exist roughly two classes of approaches. The first approach uses 
classical parametric inference, such as the nonlinear least square estimator or 
maximum likelihood estimator [^. Optimization usually involves an iterative 
process. Starting from fixed initial values mo, it finds the solution of (jl.ip using 
numerical methods such as Euler or Runge-Kutta based on the current parame- 
ter estimates. Similarly, inferences in [Tj] is based on the Bayesian principle and 
the observations are modeled by, for example, Yij ~ N{mj{Xi), ct^), which also 
requires numerically solving the of ODE's. Besides, MCMC should be used for 
posterior computation. If the initial values are unknown, they should also be 
considered as parameters and optimized together with 9 in (jl.ip . The second 
family of approaches which is closely related to ours is to directly minimize 
deviation of m' from /(m, 9). [2^ proposed a two-step method, in which m is 
first estimated from noisy data using cubic splines and then J \\m' — f{m, 9)\\'^ is 
minimized with respect to 9. In this approach, numerical solution of ODE is not 
required and unknown initial values do not add to computational burden. [l8| 
extends this approach using a single step method and optimizes the criterion 
that represents a trade-off between the fidelity to ODE and the data fit. The 
computations for the single-step approach are more involved than the two-step 
approach. 

We consider the simpler linear system of ODE's with a large number of 
coefficient parameters. 

m' = Am ,^ 2) 

to(0) = mo 

where m = (m.i, . . . ,m,p) and A is the p x p coefficient matrix. Note that for 
simplicity we do not include a constant term in the system. The solution of 
this system of ODE's is well known and is determined by the spectrum of the 
matrix A, although a full discussion considering all possible cases is complicated 
when p is large. We also regard the linear ODE's as an approximation to the 
truth so that {m,j} is not necessarily the solution of (|1.2p . In high dimensions, 
linear approximations to nonlinear ODE's make more sense since specification 
of nonlinearity is a much complicated matter. In this case, rrij is not necessarily 
an analytical function as in the linear system, but we will still assume it is 
sufficiently smooth later. 

If we use a nonparametric estimator for the solution as well as its derivative, 
denoted by fh and m', the fidelity to the ODE's can be assessed by 



\\m'-Am\fdx (1.3) 

where 1 1 • 1 1 denotes the Euclidean norm. Obviously each of the p equations can 



be considered and fitted separately and the estimation problem is p dimensional 
instead of p^ when all parameters are considered together. 

Unlike the standard linear regression, even when p is large, there still exists a 
unique solution for the least square problem (|1.3p under mild assumptions. But 
when p is large, either because of a priori beliefs on the sparsity of the matrix 
or due to consideration of interpretability of the resulting model, regularized 



or penalized method is needed. For standard linear regression, Lasso [2]| is 
probably the most popular method that uses the Li penalty 



||y-X/3|p + AV|/3, 



The Li penalty will force some of the coefficients to be equal to zero. Com- 
pared to traditional model selection method using information criteria. Lasso is 
continuous and thus more stable. More systematic theoretical studies on Lasso 
appeared later. ^] showed that Lasso is consistent for prediction, a property 
that was called persistency. Several authors la l23| have shown that Lasso is 



in general not consistent for model selection unless some nontrivial conditions 
on the covariates are satisfied. Even when those conditions are satisfied, the 
efficiency of the estimator is compromised when one insists on variable selection 
consistency since the coefficients are over-shrinked. To address these shortcom- 
ings of Lasso, [j] proposed the smoothly clipped absolute deviation (SCAD) 
penalty which is motivated by taking into account several desired properties 
of the estimator like continuity, asymptotic unbiasedness, etc. They also show 
that the resulting estimator possesses the oracle property, i.e. it is consistent 
for variable selection and behaves the same as when the zero coefficients are 
known in advance. These results are extended to the case with a diverging 
number of covariates in [5|. 24 1 proposed adaptive lasso in the fixed p case 



using a weighted Li penalty with weights determined by an initial estimator 
and similar oracle property followed. The idea behind the adaptive lasso is to 
assign higher penalty for zero coefficients and lower penalty for larger coeffi- 



cients. 



12| studied the adaptive lasso with a diverging number of parameters 
and proposed using marginal regression as the initial estimator under partial or- 
thogonality assumption. Also in the high dimensional case, |ll| showed similar 
oracle properties for the estimator with L-y penalty when < 7 < 1. 

In this paper, we study the asymptotic properties within the framework of 
sparse linear ODE using the SCAD penalty. Since the p equations are considered 
separately, we assume without loss of generality that we only want to estimate 
the first equation by minimizing 



{m[{x) ~ P^m{x)f dx + ^Pa(|/3,|) 



using some estimator for m and its derivatives. The SCAD penalty is defined 

by 

p'AO) = \{I{e <\)+ '^T^^—^IW > A)} for some a > 2 ami 9 > 0. 
[a — 1)A 



Other penalties like adaptive lasso and L^ discussed above will lead to similar 
asymptotic results, although initial estimators are required in those cases. 

The rest of the paper is organized as follows. Section 2 presents our two- 
step procedure using local polynomial regression as the solution estimator. The 
asymptotic property of the estimator is discussed. Under regularity conditions, 
we show the estimated coefficients still have parametric convergence rates even 
with nonparametric regression estimates from the first step plugged in. The 
oracle property is shown. In section 3, we conduct a simulation study to assess 
the finite sample performance and use a real dataset as an illustration of the 
procedure. We make some concluding remarks in section 4. The proofs are 
collected in section 5. 



2 Two-step estimator and its asymptotic prop- 
erties 

2.1 Two-step estimation 

For a general nonlinear system of ODE's (|l.ip . we observe its solution with 
additive noise 

Ky = mj{Xi) + Ey , i = 1, . . . , n, j = 1, . . . ,p„. 

For simplicity of notation and proof, we assume the n observation time points 
{Xi}"^2 9'i'c i.i.d. from a uniform distribution on the interval (0,1). The ob- 
servation times for all p„ variables are assumed to be the same. Although our 
estimator certainly works with different observation times for different variable 
rrij , the above assumption of identical time points makes the proof more trans- 
parent. Note that we consider the case where the number of variables diverges 
with the number of observations for each variable. 

Although the observed noisy solution comes from possibly nonlinear ODE's, 
we use a linear system as an approximation for modeling. The true parameters 
(more precisely, the best approximating parameters) is defined to be 

Ao = argmin/ \\m' {x) - A'm{x)\\^ w{x)dx (2.1) 

^ Jo 

where m = (toi, . . . ,r7ip^) and w{-) is a pre-determined nonnegative weight 
function. We assume that a unique minimizer for (|2.ip exists. Since the min- 
imum is obviously independently defined for each row of A, we only focus on 
the first row and denote it by /3o. Let /3o — {PioiPw)'^ with (320 — 0, where 
/3io is a vector of length fc„ and /320 is a vector of length Pn — fc„. This is the 
usual sparsity assumption used in various papers on high-dimensional penalized 
regression. 

When given only the noisy data Yij, we first estimate ruj using nonpara- 
metric regression. In this paper, we use the local polynomial estimator |6|. In 



local polynomial regression, for a smooth function m{x) with noisy observations 
Yi = m{Xi) + e^, z = 1, . . . , 71, we model m{x) around some point xq by 

-A m'-'^\x) ^ 

m{x)^}_^—j^{x-xor 

d=0 

where m^'^-' is the d-th derivative of m. 

This motivated the minimization of the following objective function 

mmy^iY,-y2adiX,-xYfK{^-^) (2.2) 

i=l d=0 

where a kernel function K{-) with bandwidth h is used for localization. Let 
a = {ai, . . . j&d) be the solution to the problem (|2.2p . The local polynomial 
estimator for m^'^'{xQ) is m^'^'{xo) = dlctd- Note in this paper, even though we 
use the notation m!''^^-) to denote the estimator of the derivatives, it is different 
from the derivative of ?7i(-). 

Denote by X the design matrix 

( 1 (Xi - Xo) 
X^ 



V 1 {Xn- Xo) 



and 



y 



■■ (Ai- 


-a:oj 


•• {Xn- 


- a;o) 


Y, \ 




Y^ ) 





the problem (|2.2p can be written as 

min(y - XafW(%j - Xa). 

OL 

where W is the diagonal matrix with ^( ^'^^° ) for the i-th diagonal element. 
The solution can be written in a closed form 

a = iX^WXy^X^Wy 

Some algebra shows that d^ can also be written as 

for some weight functions Wd^) depending on both xq and Xi. 

After applying local polynomial regression to observations Ky, i = 1, . . . , n 
for each j, j = 1, . . . ,p,i, we estimate the coefficients /3 by minimizing the pe- 
nalized least square objective function 

5(/3) := \ {jnx[x)-fm{x)fw{x)dx + y^vxS\P3\) (2-3) 

(3 = argminS'(/3) 



where A„ is the smoothmg parameter for the SCAD penalty p\(-)- From the 
form of the objective function, (3 only depends on the nonparametric estimators 
through the functionals J mi{x)Thj{x)w{x)dx and J ■mi{x)rh'i{x)w{x)dx,i,j = 
1, . . . ,Pn- Similar functionals are studied by different authors in the context of 
nonparametric density estimation or regression. For estimation of a density say 
/, [10| and [l| discuss estimation of /[/'''^ {x)]'^dx using kernel estimator. l4[l5t 



uses series projection to estimate functionals of more general forms. In the 
context of regression, [3| gives estimator of J[m{x)]'^dx using kernel regression, 
and [13| investigated the estimation of J [m^'^^x)]'^ dx using local polynomial 
regression. 

2.2 Asymptotic properties 

Before we present the first result, we need some notations. Denote the p„ x 
{pn + 1) matrix of integral functionals 



M 



J m[miw J mimiw J mim2W ... J mimp^w 
J m[m2W J m2miw J m2m2W ... J m2mp^w 

J m[mp^^w J mp^miw J mp^m2W ... J "m^p^mp^w 



(2.4) 



The corresponding matrix with local polynomial estimators plugged in is de- 
noted by M. For any p x (p + 1) matrix C, let 

Vech{C) = (cii,C21, . . . ,Cpi,Ci2,C22, . . . ,Cp2,C23, • ■ • Cp+l,p)^ 

be its vectorized version in p(p + 3)/2 dimensions. Thus vech{M) and vech{M) 
contains all the nonrepetitive elements in the two matrices. Let vec{C) be the 
usual vectorization of matrix C in p(j)+ 1) dimensions. Obviously, there exists a 
binary matrix $p such that vec{C) = ^pVech{C). From (j2.4p . we can write the 
matrix of integral functional as M — [b, Q], where fo is a p„ dimensional vector 
consisting of functionals of the form J m[miW and Q is the p„ x p„ matrix 
containing functionals of the form J mimjw. Similarly we write M = [b, Q]. 

Now we can state the regularity conditions for the consistency and oracle 
property of the SCAD penalized estimator. 

(A) The kernel K is a continuous bounded symmetric density function sup- 
ported on [—1, 1]. 

(B) The true solution of nonlinear ODE's, 77ij,i = l,...,p„, is three times 
continuously differentiable, and local polynomial estimator used is of order 

3 = 3. 

(C) The weight function w is bounded and nonnegative, with iw*^*' (0) = w^'-' (1) = 
0,i = 0,l,2. 

(D) The errors e = (e^, . . . , ej^)"^, with ej = (eij, . . . ,e„j)"^ denoting the 
noises associated with rrij, are independent and identically distributed 
as Ar(0,cr2). 



(E) n/i6 ^ ^^nh^ -^ 0, V^Pnih^ + ^)^0- 

(F) The eigenvalues of the matrix Q satisfies 

< Pl„ < Amin(Q) < Aniax(Q) < P2n- 

(G) ^ = o(pi„) = o(p2„), A„ ^ 0, ^^ -. 0, ^^ -. 0. 
(H) The nonzero coefficients, /3io — (/3oi,/3o2 7 • ■ • ,/3ofe„)^ satisfy 

max \Poj\ < C, for some constant C independent of n. 

i<j<fe„ 

(I) 

min |/3ofc|/A„ ^ cx). 
i<fc<fc„ 

Condition (A) is standard for local polynomial regression for estimation of 
curves with its derivatives. Noncompactly supported kernel can be used with 
increased technical complication. Condition (B) ensures that the parametric 
^/n convergence rate is achieved for integral functionals. The main purpose 
of using a weight function w is to address undesirable boundary effect in local 
polynomial regression and to make the proof cleaner. Conditions (G) — (/) are 
used to ensure the consistency and the oracle property of the final estimates 
which is standard in the high-dimensional regression literature. 

Theorem 2.1. (Asymptotic normality of integral functional estimates) Suppose 
that conditions (A)-(E) holds. Then j'^Gn^^'^{vech{M) - vech{M) -i A^(0, 1) 
for any Pn{Pn + 3)/2 dimensional vector 7„ with ||7,i|| = 1, where — > means 
convergence in distribution and Gn is the ^" ^^^ — - x 2ii}£q — l asymptotic co- 
variance matrix of vech(M) which can be obtained from Lemma \5.S\ in section 
5. 

As presented in Lemma l5.2l in section 5, the entries of the covariance matrix 
Gn is of order 0(l/n), thus the rate of convergence for vech{M) is y/n. We note 
that we intentionally presented only the much simplified version of asymptotic 
normality. When functions mj,j — l,...,p„ are not smooth enough, or the 
kernel bandwidth is chosen differently, or a lower order polynomial is used in 
nonparametric regression, it is possible to obtain asymptotic normality with 
slower rates, or with nonvanishing asymptotic bias. When this is the case, the 
following asymptotic results for $ should be modified accordingly. In particular, 
the convergence rate of /3 depends critically on the convergence rate oi vech{M) . 
In Theorem 12.21 we state the existence of a local minimizer in a neighborhood 
of the true parameter. Consistency of the global minimizer can be proved using 
peeling device as demonstrated in [ij, lul . 

Theorem 2.2. (Local consistency) There exists a local minimizer [3 of S{/3) 

2 

such that \\l3 — /3o|| = Op{—fi^ — ), when conditions (A)-(H) holds. 

7 



Theorem 2.3. (Oracle property) Let f3 — (/3i",/3j)'" be the local minimizer as 
stated in Theorem \2.2l where Pi is kn dimensional and [32 is s„ dim,ensional. 
Under conditions (A)- (I), we have 

(i) /3j = with probability converging to 1. 

(ii) For any Pn dimensional vector 7„ with unit norm, 

where 

P„ = [(1, -/3fo) ® Q-'] <i>p„G„<i>J„ [(1, -f3fo) ® Q'Y 

The above theorem states that when n is large, the zero coefBcicnts are esti- 
mated as zero with high probabiUty. The asymptotic distribution of the nonzero 
coefficients is the same as when the zero coefficients are known in advance, if 
the same nonparametric estimates are used in the first step. This fact can be 
seen easily from the proof in section 5 since the proof follows roughly the same 
lines whether or not the zero coefficients are known. Note our oracle property 
is conditioned on the estimates of the solutions from the first step, which is not 
as clean as the oracle property stated in [j|, for example. 



3 Numerical examples 

3.1 Simulation 

First we illustrate some of the computational properties of our estimates with a 
simulation study. The functions m are generated as follows. For n = 50, 100, 200 
time points, we use an even number of variables p„ = 2r„ with r„ — [n^"] 
and [•] denotes the integer part of a number. Note that asymptotically, p„ 
used in the simulation does not tally with the assumptions used for theoretical 
investigations. The observation time points are {1/n, 2/n, . . . , 1}. The p„ x p„ 
coefficient matrix A is generated as follows. 



^2i-l,2j-l — ^2i,2i — ai,^2i-l,2i — " ^2i,2i-l — bi,i — 1, 

Ai_j = all other i,j, 

Oi ^ Uniform{^4:,0),bi ^ Uniform{— 10,10). 

The structure of A has the form 

ai 61 ■ 

-bi ai ■ 

02 62 ■ 

-62 0.2 ■ 



1 ^ni 



A = 



0-r„ 



and the system of differential equations is written in matrix form as 

m' = Am (3.1) 

with ra{x) — {mi{x), . . . ,mp^{x))'^ . 

We generate m{0) from the uniform distribution and solve the initial value 
differential equation problem using the simple Euler's method. The solution is 
evaluated at those n time points and independent normal noise with standard 
deviation a = 0.1 is added at each time point. 

By the data generation mechanism, the evolution of one variable only de- 
pends on the value of itself as well as one other variable. The coefficient values 
Gi are chosen to be negative so that the solution of the differential equations is 
asymptotically stable to avoid numerical problems. 

In our experiment, we use 100 samples for each n — 50, 100, 200 generated 
from model p.ip . The two step estimator with SCAD penalty is applied. We 
use standard cross-validation to choose the bandwidth which leads to good em- 
pirical results. Cross-validation is also used in the second step. We compare 
its performance with another regression procedure in which one directly uses 
the noisy observations as covariates and finite differences as derivatives. That 
is, the model is fitted by solving the following problem (showing only the first 
equation of the linear system) 

/3ts = argminf](^|-l^ -J2^^^^^)' + E^^-^d^^D- (^.2) 

■'=2 J 3 

Since this is similar to a discrete time series model, we call its minimizer the 
TS estimator. The results are shown in Table [U where we consider several es- 
timators: two-step estimator with SCAD penalty (SCAD), two-step estimator 
with no penalty (OLS), two-step estimator using only the two variables with 
nonzero coefficients (ORACLE), ^^ with SCAD penalty (TS-SCAD), ([3^]) 
using only the two covariates with nonzero coefhcients (TS-ORACLE). The av- 
erage mean squared errors (AMSE) shown in the table are the errors for the 
two nonzero coefficients only. We also show the average number of nonzero co- 
efficients for the SCAD estimator. Estimation using (|3.2p produces much worse 
results compared to the two-step estimator which reduces the noise contained 
in the observed solutions. We also see that the number of nonzero coefficients 
selected is close to the true value. 

3.2 Real data example 

Statistical inference of genetic regulatory networks is essential for understanding 
temporal interactions of regulatory elements inside the cells. For inferences of 
large networks, identification of network structure is typically achieved under 
the assumption of sparsity of the networks. The increasing amount of high- 
throughput time course data has provided biologists a window to the under- 
standing of the biomolecular mechanism of different species. The expression of 



Table 1: AMSE for the simulation study. Numbers inside the brackets are the 
corresponding standard errors 



n 


SCAD 


OLS 


ORACLE 


TS-SCAD 


TS-ORACLE 


Average number of 
nonzero coefficients 


50 

100 

200 


2.7 (0.44) 
2.6 (0.46) 
2.9 (0.53) 


5.3 (1.42) 
7.8 (0.77) 
14.5 (3.2) 


2.4 (0.41) 
2.4 (0.73) 
2.1 (0.68) 


6.2 (2.8) 
8.2 (2.4) 
20.7 (5.2) 


7.5 (2.5) 
12.7 (4.3) 
13.0 (3.9) 


2.5 
2.3 
2.9 



genes in these studies are indicative of the dynamic activities occurring inside 
the organism. Such regulatory activities involve complicated temporal interac- 
tions among different gene products, forming genetic networks indicating the 
causal relationships between different elements. 

We demonstrate the performance of the our penalized functional model with 
the application to the cell cycle regulatory network of Saccharomyces cerevisiae. 
The dataset comes from [l9[ which provides a comprehensive list of cell cycle 
regulated genes identified by time course expression analysis. We use the 24 
unequally spaced time points of the cdcl5 synchronized expression data. Same 
as [17|, we consider 20 genes including 4 transcription factors known to be 
involved in regulatory functions during different stages of the cell cycle. 

We apply our approach to this dataset. Only the part of the coefficent matrix 
showing the interactions between each of the 20 genes and four transcription fac- 
tor is presented in Table [2l and we compare the result with known interactions 
retrieved from the YEASTRACT database [20J and treat those as the back- 
ground truth. For this submatrix, we get PPV (positive predictive value)— 0.54: 
and sensitivity— 0.83. Since all statistical models are merely mathematical ap- 
proximations to the true world, it is plausible that automatically chosen model 
undersmoothes the coefficients matrix to provide a better fit to the data. One 
can also manually specify the smoothing parameter in place of cross-validation 
to achieve desired sparsity of the networks. 

4 Conclusions 

In this paper we studied the asymptotic properties of the two-step estimator 
in high-dimensional linear differential equations, when the size of the linear 
system diverges with the density of observed time points. Using local polynomial 
estimates in the first step combined with penalized regression in the second 
step, we have shown that the estimators correctly identify zero coefficients with 
probability converging to one and the estimators with nonzero coefficients are 
asymptotically normal with parametric convergence rates. Since the covariates 
are observed with noise, the situation is similar to the errors-in-variables model 
where pretending the true covariates to be known will lead to estimators that 
are not even consistent. Thus it is crucial that only the smoothed solutions are 
plugged into the least square problem. 

The most severe theoretical restriction comes from assumption (E). With 
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Table 2: The reconstructed network structure with PPV=0.54 and Sensitiv- 
ity=0.83. The interactions retrieved from database are denoted by 'D' and the 
interactions inferred by the model are denoted by 'x'. 
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the choice oi h — 0{n ^/^) for example, the assumption y/npn{h^ + ^^) — > 
imposes the condition p„ = o{n~^'^'^). This condition is used in the proof 
of Theorem 12.11 to show the asymptotic normality of integral functionals. We 
suspect that this condition can be relaxed with more careful calculation. 

Although we only focus on the case where the ^/r^ rate is achieved, this of 
course depends on the smoothness of the solution as well as choice of bandwidth. 
In other situations, it might happen that the integral functionals converge with 
a different rate, which will also slow down the rate of the linear coefficient 
estimates. A comprehensive treatment considering all possible cases is beyond 
the scope of the current paper. 



5 Proofs 

First we investigate the asymptotic properties of the integral functionals / rrii {x)mi {x)w{x)dx 
and J mi{x)'mj{x)w{x)dx. The following Lemmas give asymptotic bias and 
variance of the local polynomial estimators and their proofs are similar to those 
found in [13], which only studied the quadratic functional J[Tn\ {x)]'^w{x)dx. 
We need to introduce more notations before presenting the lemmas. Let Ai 
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be the n x n matrix with the (i, j)-entry auj — J Wo{ '^\^ )Wi{ \^ )w{x)dx. 

Similarly, let yl2 be the nxn matrix with the (i, j)-entry a2ij = J Wo{ ''^'f^^ )Wi{ \^ )w{x)dx. 
The matrix A2 is symmetric while Ai is not. Let Bi denote the symmetrized 
version of Ai, i.e., Bi = {Ai + ^f )/2. 

Lemma 5.1. Under conditions (A)-(C) together with nh — > cx3, conditioning 
on the random time points {Xi}f^i, 



triA,) ^ iC + o,il))-^ 



tr{AiAi) = (C + Op(l)) 



1 

nh'^ 
1 



/i3 



triA2) = (C + o,(l))-i- 
triAjA^) = {C + Op{l)) ^ 



71^ h"^ 
tr(Al) = (C + o,(l))^ 

where in the above expressions, different appearances of C denotes different 
constants depending on K and w. Similar observation applies to the next lemma 
as well. 

Proof. The calculations for tr{A2) and tr{A2) are special cases studied in [13j . 
In particular, the calculations are contained in the proof of their Theorem 4.1, 
equations (7.3) and (7.19). The proofs for all other cases are similar and omitted. 

D 

Lemma 5.2. Let 61 — J m'lmiw and 62 = Jm'im2W, O3 — J mimiw and 
04 — J mim2W. Use 9i, . . . ,64 for the corresponding estimated version with true 
functions replaced with local polynomial estimates. Under the same conditions 
as stated for Lemma \5.1l 

(a) the asymptotic bias is 

E{ei)~e, = {Ci + opii))h^ + {C2 + op{i)) ^ 



E{e2)-e2 = {Ci + oj,{i))h^ 

E{h)-Oz - (Ci+Op(l))/i4 + (C2 + Op(l)) 

£{64) -e^ = {Ci + op{i))h^ 



nh? 

1 
nh 
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(b) the asymptotic variance is 



VariOi) = {C\+Op{l))^ + {C2 + o.p{l))- 

n^n-^ n 

Var{e2) = (Ci+Op(l))^ + (C2 + Op(l))- 
Varies) = (Ci+Op(l))4r + (C2+Op(l))- 
Varie^) = (Ci+Op(l))47- + (C2+Op(l))- 



(c) similarly, we can calculate the covariances. For example, 

Covi9i,92) = (Ci + Op(l))^ + (C2 + Op(l))- 

The above biases and variances are implicitly conditioned on the random, time 
points {Xi}f^^. 

Proof. The calculation follows that of [l3| and we refer the reader to that pa- 
per for details, giving here only some short explanations of the proof as well 
as pointing out the differences when dealing with non-quadratic forms which 
were not studied in [l3|. For ease of notation, within the current Lemma, we 
let y'^ = (Fi, . . . , Yn), and Z'^ = (Zi, . . . , Z„) be the noisy observations for 
functions mi and TO2 respectively, with additive noise ei = (en, . . . ,e„i) and 
£2 — (ei2, • ■ • , en2)- 

Obviously we have 9i = Y^BiY = mjBimi + 2miBi€i + ejBiei. Given 
{Xi}, the conditional expectation is m^ Bimi +a'^tr{Bi). m^ Binii contributes 
to the Op{h^) term in the bias and tr{Bi) = Op(;^). For §2 == Y'^AiZ = 
miAim2 + miAie2 + m^Ajei + eiAie2, since the noises ei , €2 are independent, 
the bias only comes from mfBim2 = Op{h^). 

The conditional variance of Y^ BiY is Aa^niiBimi + 2aHr{B\), where the 
first term comes from the variance of 2miBiei and the second term comes from 
variance of eiBiei. Calculations show that the first term is of order Op ( ^^yp- -|- i ) 
while the second term is Op(;jTp). All other expressions are derived in the same 
way. 

D 

Proof of Theorem \2.1[ We want to show the asymptotic normality of vech(M). 
For convenience, we denote the components of uec/i(M) hyuk,k — 1,2, . . . ,PniPn+ 
3)/2 and the components of vech{M) by Uk- From the proof of Lemma [5T2l one 
can see that Uk is of the form Uk = {mi + ei)^C{mj + ej) with either C = Ai or 
C = ^2 and J possibly equals j. Thus Wfc — miCmj+mfCej+mJC^ei + efCej. 
By the proof of the lemma, the first term is Uk + Op{h^ + jj^) and the last term 

is Op ( \/(ir(C2) ) = Op{\ j^j^). This is the pseudo-quadratic situation as de- 
scribed in [l3| since the linear term, which is of order Op (-7=), dominates the 

13 



quadratic term. Thus we have vech{M) — vech{M) — V + R, where each com- 
ponent of y is a hnear combination of the errors and of order -^ due to the 

calculation of Lemma [5T21 and each components of R is of order Op{h^ + ;^)- 
Let the asymptotic covariance matrix of vech{M) — vech{M) be denoted by 

1/9 

Gn, which is the dominating term in Lemma 15.21 (b). Obviously ^nGn V 
has an asymptotic standard normal distribution. The same will be true for 
7nGn^^^wec/i(M) - vech{M)) ifjnGn^^^R = Op(l). Since ||G;;:^/^|| is of oder 

Op{\/n), we have 7„G^ ' R = Op{^/npn{h'^ + ;^))- The results follow from 
assumption (E). D 

The following lemma bounds the eigenvalues of Q. 

Lemma 5.3. Ifpn/Vn = o(pi„) and pn/y/n = o(p2n), then Amax(Q)//32n = 
Op(l) and pin/KiiniQ) = Op(l). 

Proof. By the Gershgorin Circle Theorem ([8| Theorem 7.2.1), the eigenvalues 
of Q — Q lie inside the interval [— G-^, +G-^] with high probability for large 

enough constant G. By the assumption, we have Amin(Q) ^ Aniin(Q) ~ G^ > 

pi„/2. Same proof applies to for Amax(Q)- 

D 

Proof of Theorem \2.2l Let t„ = C—^ — . Following [El, we will show that for 
any e > we can find a large enough constant G such that 

P{ sup 5(/3o + T^u) > S{po)} > 1 - e. (5.1) 

11^11=1 

Simple calculations show that 

SiPo + Tnu)-SiPo) > -2TnU^ib-QPo) + T^u^Qu (5.2) 

+ X] p^" (^oj + '^""j ) ~ X! ^^" ^^oj' ) • 

i=i j=i 

Since (3q minimizes J{rn'i{x) — f3'^m{x))'^w{x)dx, we have the normal equation 
QPo — b. Thus we can bound the first term on the right hand side of (|5.2p as 

\2TnU^{b-Q(3o)\ = \2TnU^ib^b-iQ-Q)Po)\ 

= Opir^^). 

By Lemma 15.31 the second term t^u^Qu can be bounded below by Op(r^pi„). 
Thus the second term dominates the first term when G is large enough. Same 
as (5.5) and (5.6) in ^] the contribution of the penalty terms are also dominated 
and (|5.ip is proved. D 



To make the proof of Theorem 12.31 less cluttered, we show the variable se- 
lection consistency in a separate Lemma. 
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Lemma 5.4. Under the assumptions of Theorem ] 2. 31 the local minimizer found 
in Theorem \2.2\ with (3 — {Pi ,02^' 's consistent in variable selection: (32 = 
with probability converging to one. 

Proof. For the objective function S{(3) defined in (|2.3p . we have 

— = 2iQP-b)+p'^J\P,\)sgniP,). (5.3) 

2 

Consider j — Ain + 1, ■ • • , J>n. When < \f3j\ < C—^ — , we can bound 

WQd-hW = ||Q/?o-fo + Q(/3-/3o)|| 

= ||(Q-Q)/3o-(6-6) + Q(/3-/3o)|| 

< IIQ-QIIII/3o|| + ||&-6|| + ||g||||/3-/3o|| 

< -^ + ^=+P2n^= - (^p( !=)■ 

V" V '^ V"Pln Pin V " 

2 2 

^^'^'^'^ V^pr„A„ ^ '^' ^"^^ I^jI - ^vfc' '^'^ ^'^'^'^ liminfPA„(l/?jl)/^" > by the 
form of the SCAD penalty. So the penalty term in (|5.3|) dominates. Thus 

dS ^ ,, r„„ n . ^ . ^ Pn 



-- > for < /?j < C- ^ 

OPj \/npin 

^ < for > /?, > -C-£^ 

OPj V"-Pln 



which implies that the minimum is achieved at exactly (3j = Q. D 

Proof of Theorem \2.3[ Now that part (i) has been proved in Lemma 15. 4i we 
focus on asymptotic normality. Since P2 — with probability converging to 
one, we can concentrate on Pi and denote it simply as /3. First, by assumptions 
(G), (I) and Theorem [23 VpA„(/3i) = 0. From the fact that VS0) = 0, it 
follows that 

-b + Qp = 0. 

Together with —b + Qj3q = 0, we have 

-{b -b) + {Q- Q)(3o + Q{f3 - (3o) = 0, 

or 

P-Po = Q"M(^-^)- (<3-Q)/3o]+ higher order terms 

1 



= Q-\M-M), 
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Since 



-'(M-M)^) . .,„(Q-'(M-M)(_;j) 



[{1,-I3l) ® Q-^]vec{M - M) 

[(1, -/?;[) (g) Q-i]$p„^ecft(Af - M) 



the conclusion follows directly from Theorem 1 2. II D 
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